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AN  IMPROVED  FFT  FOR  A  FOUR  PARALLEL  PIPE  SIMD  ARITHMETIC  PROCESSOR 


INTRODUCTION 


The  effectiveness  of  sonar  and  radar  is  being  improved  by  increased 
processing  of  the  received  signals.  The  fast  Fourier  transform  (FFT)  is  a 
basic  building  block  for  various  signal  processing  enhancement  algorithms. 
The  computational  load  imposed  by  signal  bandwidths  and  number  of  oeams 
requires  parallel  processors. 


A  militarized  signal  processor,  the  AN/UYS-2,  that  utilizes  parallelism 


ana  data  flow  constructs  is  being 
increasea  computational  capability. 


constructed  to  satisfy  the  need  for 
Size  constraints,  power  and  reliability 


requirements,  etc.,  along  with  given  internal  data  transfer  rates,  dictated 
a  single-instruction,  multiple-data  (SIMD),  four  parallel  pipe  arithmetic 
processor.  Connector  pin  limitations  on  the  boards  that  are  usea  to  package 
the  processor  necessitated  use  of  a  common  coefficient  memory  feeding  the 
four  parallel  data  pipes.  This  common  coefficient  memory  also  is  the  only 
cross-connection  between  the  four  pipes. 


The  problem  is  to  utilize  the  four  parallel  data  pipes  in  the  given 
architecture  to  perform  the  FFT.  One  method  would  transfer  data  in  four 
independent  sets  and  then  perform  four  parallel  FFT 1 s  in  lock  step. 
However,  this  is  not  -  compatible  with  the  data  flow  architecture  of  the 
AN/UYS-2.  Another  method  would  perform  one  FFT  on  one  data  set  four  times 
faster  using  the  four  parallel  pipes  simultaneously. 


This  report  describes  the  second  method,  whereby  an  FFT  computation  on 
a  SIMD  four  parallel  pipe  arithmetic  processor  can  proceed  approximately 
four  times  faster  than  on  a  single  pipe  arithmetic  processor  with  the  same 
instruction  cycle  rate.  The  basic  concept  uses  four  parallel  data 
processing  pipes  to  compute  one  radix-4  FFT.  Only  one  pipe  would  be 
employed  to  perform  the  last  stage  of  the  FFT.  The  first  step  in  explaining 
the  concept  is  to  derive  a  suitable  version  of  the  radix-4  FFT.  i«xt, 
formulas  are  derived  to  show  the  addition  and  multiplication  time  needed  to 
compute  an  N-point  FFT.  Then,  the  FFT  algorithm  for  using  mixed  raoixes  is 
given.  ^This  method  works  almost  as  well  when  computing  radix-2  or  mixed 
radix-2  and  -4  FFT's.)  An  efficient  scheme  for  prescrambling  the  mixed 
raoixes  is  also  derived.  This  method  can  be  utilized  to  compute  the  inverse 
discrete  Fourier  transform. 
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AVAILABLE  ARCHITECTURE 


The  external  architecture  for  the  four  parallel  pipe  arithmetic 
processor  is  shown  in  figure  1,  and  the  internal  architecture  of  a  pipe  is 
shown  in  figure  2.  Each  pipe  has  its  own  operand  memory  that  can  be 
simultaneously  read  and  written  to.  A  limitation  of  this  architecture  is 
that  the  common  coefficient  memory  is  shared  by  all  pipes,  and  crosscoupling 
between  pipes  is  accomplished  by  one  pipe  writing  into  memory  and  another 
pipe  reading  from  that  same  memory. 

With  the  above  parallel  processing  architecture  in  mind,  the  question 
is  how  to  utilize  the  hardware  to  compute  the  FFT.  The  radix-4  FFT  is  shown 
to  be  an  efficient  way  to  compute  the  FFT,  irrespective  of  what  hardware  is 
used,  because  four  data  points  are  multiplied  by  the  vector  [I,  j,  -1,  -j], 
which  reduces  the  number  of  complex  multiples  required  by  a  factor  of  about 
4.1,2 


CONCEPT  AND  DERIVATION 


In  partitioning  the  FFT  into  four  pipes,  i.e.,  placing  one  fourth  of 
the  data  points  into  each,  of  the  separate  pipes,  a  decimation-in-time  on  the 
incoming  data  set  can  be  performed.  One  formulation  of  the  radix-4 
decimation-in-time  FFT  does  not  require  cross-connection  between  the  four 
data  sets  until  the  very  last  stage  of  the  FFT.  For  this  last  stage,  the 
independent  results  of  the  four  separate  pipes  are  transferred  into  the 
operand  memory  of  only  one  pipe.  This  transferral  operation  is  necessary 
because  this  stage  of  the  FFT  computation  requires  access  to  all  points  to 
form  the  resulting  frequency  sample. 


DERIVATION  OF  A  RADIX-4  FFT 


The  derivation  of  the  radix-4  FFT  is  based  on  a  decimation-in-time  of 
the  input  sequence,  x(n),  into  smaller  subsequences:  Assume  that  the  number 
of  input  data  points  is  a  power  of  4;  i.e.,  N  =  4v,  and  the  object  will  be 
to  compute  the  discrete  Fourier  transform  (DFT): 


N-l  -2jri 

X(k)  x(n)wjk,  WN  =  e  N  ,  k  =  0,  1 . N  -  1;  j  =  y/T  (1) 

n=0 


2 
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Figure  1.  Architecture  of  the  Four  Parallel  Pipe  Arithmetic  Processor 


3 
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Figure  2.  Internal  Architecture  of  One  Pipe 
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Since  N  is  a  power  of  4,  separate  the  input  sequence,  x(n),  into  four 
N/4  data  point  subsequences  as  follows: 


4 r,  4r  +  1,  4r  +  2,  4 r  +  3, 


where 


r  =  0,  1,  ...  ,  (N/4)  -  1. 


Now  write  the  OFT  as 


(N/ 4)  — 1 


X(k)  * 


£  x(4r)wjrk  -  £ 


( N/ 4) — 1 

Ark  +  X(4r  ♦  l)'44r+1)k 


r  =  0 


r  =  0 


tN,4)-1  .  ,,B(4r*2)k 


*  E 


x(4r  +  2)W 


N 


x(4r 


3)wJ4r+3)k,  ^ 


r  =  0 


r  *  0 


k  =  0,  1,  ....  N  -  1. 

Next,  write  the  OFT  in  terms  of  the  four  subsequences: 


( N  /  4)  —  1 


x(k)  = 


x(4r)'^k+  Wk 


( N  /  4 )  —  1 


x(4r  *  1 )  >1 


4rk 


N 


r  »  0 


r  =  0 


( N / 4 ) — 1  (N/4) -1 

+  4*  T  *(4r  +  2)wJrk  +  ^k  Y  x(  4r  +  3)4rk-  (3) 
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But 


(4) 


~2lUr 

4r 

because  =  e 


-2uj  r 

« wiral< 


r 

(N/4)  * 


So 


( N  /  4)  — 1 


X(k) 


x(  4r )  W/*  iu 


( N  /  4 )  —  1 


r*  +  j* 

(N/4)  2^ 


x(4r  +  l)W$/4) 


r  =  0 


r  =  0 


( N/4) — 1 


( N  /  4 )  — 1  . 


£  *(4r  *  2)^/4) *  »3k  £  x(4r*3)^/4).  (5) 


r  =  0 


r  =  0 


Although  the  index  k  ranges  over  N  values,  k  =  0,  1,  ...  ,  N  -  1,  each 
of  the  sums,  Go(k),  Gi( k ) ,  G2(k),  and  G3(k),  need  be  computed  for  k 
=  0,  1,  2,  ...  ,  (N/4)  -  1  because  each  is  periodic  in  k  with  a  period  of 
N/4,  where 


(N/4^-1 

G0(k)  =  x (  4r) W( N/4)  > 

( N  /  4 ) — 1 

G,  ( k)  =  V*  x(4r  + 


(7) 
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( N  /  4)  — 1 

S2(k)  =  ^  x(4r  +  2)w[5/4),  (8) 

r  =  0 


and 


(N/4)-l 

G3(k)  =  V  x(4r  +  3)wf|i/4).  (9) 

r  =  0 


Note  that  each  of  the  sums  in  the  above  expressions  is  an  N/4  data 
point  transform.  Also,  as  k  ranges  over  0,  1,  ...  ,  N  -  1,  each  of  the  sums 
produce  only  N/4  different  values  of  the  N/4  point  transforms. 

Now  rewrite  equation  (5)  as 


X(k)  =  GQ(k)  +  ^(k)  +  '^kG2(k)  ♦  '^kG3(k),  k  .  0,  1 . N  -  1.  (10) 


But,  as  noted  above,  each  G-j(k),  i  =  0,1, 2, 3,  is  periodic  in  k  of  period 
N/4.  Since  k  ranges  from  0,  1,  ....  N  -  1  values,  partition  k  into  four 
sequences,  which  are  given  by 


m,  m  +  (N/4),  m  +  2(N/4),  m  +  3( N/4) ;  m  =  0 . (N/4)  -  1. 


Substitute  these  m  values  for  k  in  equation  (10)  to  obtain  the  basic  radix-4 
kernal : 


X(m)  -  G0(m)  +  wjjG^m)  ♦  wj^G^m)  +  ^mG3(m),  m  =  0, 


(N/4)  -  1  (11) 


7 


X[m  +  ( N/4) ]  =  bQ(m)  +  W^m+(N/4)]G1  (m)  +  w2Cm+(N/4)]  6^m) 


w3Lm-(N/4)3G3(m), 


X[m  +  2(N/4)]  =  G0(m)  +  W^m+2( N/4)^G1  (m)  +  w2[m+2(  N/4)]^ (m) 


+  w3[^2(N/4)]G3(m)> 


X[m  +  3(N/4)]  =  G0(m)  +  w|>+3(  N/4)  (m)  +  W2[m+3( N/4) ]&2 (m) 


+  W^m+3^/4)  ] G3 ) 


Separating  the  k  frequency  terms  into  four  sets,  as  shown  above,  and 
factoring  common  powers  of  -W^  reduces  the  number  of  multiplications  in 
equations  (11)  through  (14)  from  12  to  3.  This  reduction  is  seen  by  noting 
that 


w[m+(N/4)]  _  wm  w(N/4)  _  (  ,)wm 
n  N  N  '  J '  N 


w2[m+( N/4) ] 
N 


-  W2"1  W2(N/4)_  /  _i  )W2m 
*  N  "  1  1  'wh  ’ 


1J3[ro+(N/4)l  tl3m  ,,3(N/4)  ^,„,3m 
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[m+2( N/ 4) ]  , 
N  ‘  1 

-iK. 

(18) 

2[m+2(N/4)] 

N 

(l)-wf. 

(19) 

w3[m+2(N/4)] 

WN 

=  (-D  •w*}". 

(20) 

u[m+3( N/  4)] 
"N 

= 

(21) 

w2[m+3( N/4) ] 

"n 

*  (-1)  *W lm. 

(22) 

ana 


u3[m+3(N/4)] 

"N 


(j)  •« 


3m 
N  • 


(23) 


Now,  where  appropriate,  substitue  equations  (15)  through  (23)  into  (11) 
through  (14)  to  simplify  the  basic  radix-4  kernal  in  equations  (24)  through 
(27): 


X(m)  =  G0(m)  +  wJJ  G1(ra)  +  W^mG2(m)  +  uj^im)  (24) 

m  -  0,  1,  ...  ,  (N/4)  -  1, 

X[m  ♦  (N/4)]  =  G0(m;  ♦  (-jjwJJb^m)  ♦  ( -lJW^&^m)  *  (+j ,m) ,  (25) 
XLm  +  2( N/4) J  =  Gf  (m)  ♦  l-l/wfti,  (m)  ♦  (l)W?%(m)  ♦  v -l/A,^) ,  (26) 


X[m  +3  (N/4) ]  =  G0(m)  +  (^^(m)  *  (-l)'^mG2(m)  +  (-j )wJnG3(m) .  (27) 


The  flowchart  for  the  computation  in  equations  (24)  through  (27)  can  be 
drawn  as  shown  in  figure  3. 


X(m) 


X{  m  +  (N/4)  ] 


X  [  m  +  2  (N/4)  ] 


X  [  m  +  3  (N/4)  ] 


m  =  0,1 . (N/4)  -  1 


Figure  3.  Basic  Radix-4  Kernal  Computation  of  a  Radix-4  FFT 


Each  of  the  G-j(m)  are  (N/4)-point  transforms  and  since  N  =  4v,  each 
of  the  transforms  can  be  broken  down  to  four  (N/4)-point  transforms  to  create 
another  stage  in  the  computation.  This  staging  can  be  continued  until  one  of 
the  transforms  contains  only  four  points;  for  example,  when  m  =  0  and  N  =  4 
one  obtains 
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X(0)  =  G0(0)  +  ^(0)  +  w°g2(0)  +  w°g3(0), 

(28) 

X(0  +  1)  =  Gq ( 0 )  -  jW^G^O)  -  W°G2(0)  *  jW°G3(0), 

(29) 

X(0  +  2)  -  G0(0)  -  IwJg^O)  +  hJg2(0)  -  wJg^O), 

(30) 

and 

X(0  ♦  3)  -  Gq(O)  +  jW°G1(0)  -  wJg2(0)  -  jW°G3(0). 

(31) 

But 

G0(0)  =  x(0). 

(32) 

Gl(0)  *  x(l). 

(33) 

g2(0)  =  x( 2) , 

(34) 

and 

G3(0)  *  x(  3) . 

(35) 

Equations  (28)  through  (31)  can  be  condensed  to 

X(k)  -  x(0)W°  *  x( 1)  +  wfx(2)  ♦  wf x(3),  k  =  0,1, 2, 3. 

(36) 

Thus,  equations  (24)  through  (27)  with  m  =  0  give  tne  same  result 
a  four-point  OFT  according  to  equation  (1),  as  shown  below: 

as  taking 
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XU) 

=  x(n)^Jk 

,  k  =  0,1, 2,3, 

n=0 

X(0) 

=  x(  0) 

+  x{  1) 

+  x(  2)  +  x(  3) , 

X(1)  = 

x(0)  + 

wjx(l) 

+  ^x(2)  + 

<$c(3). 

X  ( 2 )  = 

x(0)  + 

wjx(l) 

+  W^x(2)  + 

wjx(3). 

X(3)  « 

x(0)  + 

*Jx(l) 

+  wjx(2)  + 

*Jx(3), 

which  reduce  to 

X (0 )  *  x{0)  ♦  x(l)  *  x ( 2 )  ♦  x(3), 
XU)  =  x(0)  -  (j)x(l)  -  x(  2)  *  jx(  3) , 
X(2)  *  x(0)  -  x(l)  +  x(  2)  -  x(  3) , 

and 

X( 3)  =  x(U)  *  (j)x(l)  -  x( 2)  -  jx( 4) . 


The  result  is  an  algorithm  that  repeatedly  uses  the  basic 
in  figure  3  can  be  derived. 


(37) 

(38) 

(39) 

(40) 

(41 ) 

(42) 

(43) 

(44) 

(45) 

computation 


The  advantages  of  deriving  equations  (24)  through  (27)  is  that  the 
number  of  multiplications  in  the  basic  radix-4  kernal  is  reduced  from  12  to 
3  because  G].(m)  =  (a  ♦  jb)  multiplied  by  a  power  of  j  reduces  to 


(a  +  jb)(j)  *  -b  +  ja 


(46) 


and 


(a  ♦  jb) (-j)  -  b  -  ja. 


(47) 


Also 


(a  +  jb) (-1)  *  -a  -  jb. 


(48) 


Thus,  by  interchanging  the  real  and  imaginary  components  or  by  negating,  one 
is  able  to  perform  the  required  multiplication. 


The  radix-4  decimation-in-time  FFT  was  derived,  and  a  reduction  in 
multipl  icatons  by  a  factor  of  4  was  shown.  Next  will  be  shown  how  the 
radtx-4  FFT  is  computed  on  a  four  pipe  arithmetic  processor  similar  to  the 
one  in  figures  1  and  2. 


ADAPTING  RADIX-4  FFT  TO  A 
FOUR  PIPE  ARITHMETIC  PROCESSOR 


The  object  here  is  to  divide  the  input  sequence  evenly  among  the  four 
pipes  and  let  each  pipe  perform  a  radix-4  kernal  computation  on  its  input 
data  points  independently  of  other  points  in  other  pipes.  See  figure  4  for 
the  flowchart  for  a  16-point  radix-4  FFT.  Since  the  four  parallel 
computations  use  the  same  coefficients  and  use  the  points  in  exactly  the 
same  manner,  the  computation  can  be  carried  out  using  a  SIMD  architecture 
with  a  common  coefficient  memory. 


The  important  point  is  that  in  stage  1  of  the  FFT  each  radix-4  kernal 
computation  in  a  given  pipe  "calls"  on  the  data  points  only  within  that 
pipe,  and  no  cross-connection  between  operand  memories  is  required. 


Flowchart  for  a  16-0ata-Point  Radix-4  FFT 
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However,  stage  2  (the  last  stage  in  general)  requires  access  to  all  the 
points  of  the  intermediate  results  from  the  previous  stage.  The 
architecture  restriction  on  crosscoupl ing  dictates  that  a  single  pipe  must 
be  used  for  the  last  stage  of  the  computation.  The  architecture  allows 
points-  to  be  selectively  read  from  the  individual  operand  memories  and 
routed  to  a  given  pipe,  say  the  first  pipe  in  figure  1. 


In  the  last  stage  of  the  FFT  the  appropriate  field  in  the  microword  of 
the  microsequencer  controlling  the  various  data  paths  is  set  to  read  Go(0) 
from  operand  memory  1,  G].(0)  from  operand  memory  2,  &2(0)  from  operand 
memory  3,  and  finally  63(0)  from  operand  memory  4  to  compute  X(0). 
Likewise  the  remaining  intermediate  results  are  selected  from  the  individual 
pipe  operand  memories  and  combined  in  pipe  1  to  produce  X(2),  X(3),...,X(N  - 
1). 


In  general,  if  N  =  4V,  then  v  -  1  stages  of  the  radix-4  FFT  can  be 
computed  in  the  four  parallel  pipes.  The  last  stage,  tne  Vth,  would  then 
be  computed  using  only  one  pipe. 


A  measure  of  tne  time  required  for  an  N  point  radix-4  FFT  can  be 
arrived  at  as  follows:  Let  N  *  4V ;  thus,  the  FFT  requires  log^Nj 

stages  for  the  FFT  computation.  By  looking  at  the  basic  radix-4  kernal, 
given  in  equations  ill;  through  (14),  and  its  flowchart  in  figure  3,  it  is 
seen  that  3  complex  multiplies  ana  12  complex  ados  are  required  per  kernal. 
Also,  note  that  (N/4) -Kernals  must  be  computed  for  each  stage.  If  a  unit  of 
time  is  required  for  one  complex  multiply,  then  tne  multiply  time  for  an  N  = 
4V  point  radix-4  FFT  is  given  by 


3  [log4(N)  -  l](N/4) 

4 


+  3( N/4)  =  (3N/4) 


1 


log4  (N)  -  1  - 

- 


(49) 


where 


3  [log  4(N)  -  1]  N 
- 5 - (-5) 


is  the  time  require^  to  compute  the  first  v  -  1  stages  in  parallel  and 
3( N/4)  is  the  multiplication  time  to  compute  the  last  stage.  Similarly,  the 
time  for  computing  the  complex  adds  is  given  by 
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12( N/4) 


log 4  (N) 


12(  N/4)  =  3N 


log4  (N)  -  r 


8y  judicial  use  of  the  architecture  within  a  given  pipe,  as  shown  in 
figure  2,  one  is  able  to  reduce  the  addition  time  by  “pipelining  in  time" 
the  FFT  computations  and  performing  two  parallel  additions  for  each 
multiplication.!  (The  phrase  "pipel ining-in-time“  means  that  three  sets 
of  latches  are  interposed  in  the  path  through  a  given  arithmetic  processor 
(one  of  four  parallel  pipes)  so  that  after  three  clock  periods  the  pipe  is 
full,  and  addition  and  multiplication  is  overlapped.)  In  other  words,  the 
time  for  computing  a  given  FFT  is  proportional  to  the  time  for 
multiplication,  which  is  given  by 


Time  = 


log4  (N)  -  1' 


where  one  can  see  that  as  N  becomes  larger  the  multiplication  time  decreases 
to  approximately  one  fourth  of  that  when  using  a  single  pipe.  Thereby,  one 
FFT  can  be  computed  approximately  four  times  faster  by  using  the  four 
parallel  pipes. 

Next,  some  practical,  considerations  in  implementing  this  radix-4 
algorithm  on  a  four-paral lei -pipe  architecture  are  given;  i.e.,  the  size  of 
the  FFT,  size  of  the  operand  memory,  mixed  radix  FFT's,  speed  formulas,  bit 
reversal  required  to  re-order  input  data,  and  inverse  FFT's  are  discussed. 


PRACTICAL  CONSIDERATIONS 


FFT  SIZE 


The  size  of  the  FFT  determines  how  efficiently  the  four  parallel  pipes 
can  be  used.  Generally,  the  more  data  points  there  are  to  transform,  the 
less  the  effect  of  using  only  one  pipe  for  the  last  stage  nas  on  overall 
computation  time.  The  time  required  to  transform  several  FFT  sample  sizes, 
using  both  a  single-  and  a  parallel-pipe  FFT,  are  given  in  table  1,  where  it 
can  be  seen  that  the  scheme  approaches  the  theoretical  reduction  in  speed  of 
a  factor  of  4. 
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Table  1.  Efficiency  of  a  Four  Parallel  Pipe  FFT 
As  a  Function  of  Transform  Size 


Size  of  FFT 

Time  for 
Single-pipe  FFT 

Time  for 

Parallel-Pipe  FFT 

Ratio  Of 

Single-  to  Parallel- 
Pipe  Scheme 

256 

768 

336 

2.3 

1,024 

3,840 

1,536 

2.5 

4,096 

18,432 

6,912 

2.7 

16,384 

86,016 

27,648 

3.1 

Another 

characteristic  of 

FFT  size  is  that 

it  must  oe  a  power  of  1 

Efficiency  is  increased  by  this  restriction  because  the  powers  of  are 

located  at  the  90-deg  quadrant  points  of  the  unit  circle,  and  most 
multiplications  reduce  to  multiplying  by  +1,  +j ,  -1,  or  -j.  However,  this 
parallel  processing  method  can  be  adapted  to  work  on  data  sizes  that  are  a 
power  of  2. 


MIXED-RADIX  FFT's  AND  SPEED  FORMULAS 


Acoustic  signal  processing  generally  uses  time-bandwidth  products  where 
512,  1024,  2048,  or  4096  points  are  to  be  transformed.  Since  512  and  2048 
are  powers  of  2  and  not  4,  a  mixed-radix  FFT  can  be  employed  to  retain  most 
of  the  FFT  efficiency  by  writing  512  as  2  •  44  and  20  48  as  2  *45.  For 
the  512  sample  FFT,  one  radix-2  stage  and  four  radix-4  stages  must  be 
performed  to  complete  the  computation.  Similarly,  for  the  2048  computation, 
one  radix-2  stage  and  five  radix-4  stages  are  required. 


When  performing  the  mixed  radix  FFT  on  the  SIMD  architecture  in  figure 
1,  the  radix-2  stage  can  be  performed  first  or  last.  A  slignt  increase  in 
speed  can  be  achieved  when  the  radix-2  kernals  are  used  in  the  last  stage. 
The  reason  for  this  can  be  explained  using  a  32  =  2  •  42  data  point  FFT  as 
an  example. 


The  example  FFT  is  performed  by  computing  a  radix-4  first  stage,  a 
radix-4  second  stage,  and  finally  a  radix-2  third  (last)  stage  (figure  5). 
First,  the  32  original  data  points  are  divided  equally  among  the  4  pipes  so 
that  each  pipe  has  8  points.  Two  radix-4  kernals  are  computed  in  each  of 
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Figure  5.  Partial  Flowchart  for  a  32-Data -Poi nt  FFT  Using 
Two  Radix -4  Stages  Followed  by  One  Radix -2  Stage 
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the  four  parallel  pipes  in  the  first  stage.  Now,  the  second  stage  radix-4 
computations  are  performed,  but  for  this  stage  points  from  pipes  one  and  two 
are  crosscoupled.  Because  of  the  crosscoupling  restriction,  half  of  the 
second  stage  computations  must  be  performed  in  only  one  pipe,  say  pipe  one. 
Similarly,  the  points  in  pipes  three  and  four  must  be  combined  into  one 
pipe,  say  pipe  three.  Therefore,  only  two  of  the  possible  four  pipes  can  be 
used  in  the  second  stage  FFT  computation. 


The  third  stage  computation  presents  the  same  problem  as  the  second 
stage  because  data  points  from  pipes  1  and  3  must  be  combined  in  only  one 
pipe,  say  pipe  one.  Re-ordering  the  staging  as  4,  4,  and  2  prevents  the 
four  pipes  from  being  used  in  parallel  for  the  last  two  stages.  Also,  in 
the  last  stage,  the  radix-2  FFT  will  require  multipl ication  by  32/2  =  16 
different  powers  of  ^32.  However,  the  first  stage  multiplication 
coefficients  reduce  to  powers  of  J,  which  speeds  up  the  computation  time. 


Alternatively,  we  can  reorder  the  32  data-point  FFT  stages,  as  shown  in 
figure  6,  so  the  radix-2  stage  is  performed  first,  followed  by  two  radix-4 
stages.  The  first  stage  radix-2  kernal  computations  are  performed  in  the 
four  parallel  pipes,  with  each  pipe  computing  four  2-point  FFT's.  The 
second  stage  radix-4  kernal  computations  can  again  be  performed  in  the  four 
parallel  pipes  because  each  computation  requires  only  those  points  already 
in  its  own  operand  memory.  The  result  is  that  four  8-point  FFT's  are 
performed  in  parallel  in  the  second  stage. 


The  third  stage  is  the  only  computation  that  needs  data  points  from  the 
other  pipes.  Therefore,  the  third  stage  radix-4  kernal  computations  must  be 
performed  in  only  one  pipe,  say  pipe  one.  This  arrangement  permits  two 
stages  of  the  FFT  to  be  done  using  the  four  parallel  pipes,  and  only  the 
last  stage  must  be  done  using  one  pipe.  Also,  the  radix-2  coefficients 
needed  in  the  first  stage  are  +1  or  -l,  thereby  eliminating  multiplication 
by  powers  of  ^32.  in  general,  whenever  radix-2  and  -4  stages  are 
required,  the  radix-2  stage  should  be  performed  last. 


Formulas  for  the  number  of  multiplications  involved  for  mixed  radixes 
are  given  next. 


dhen  N  is  equal  to  two  times  some  power  of  four  (i.e.,  N  =  2  •  4v), 
the  complex  multiply  time  can  be  determined  by  allowing  for  3(N/4) 
multiplies  for  each  radix-4  stage  and  N/2  multiplies  for  each  radix-2 
stage.  If  any  stage  is  performed  in  four  parallel  pipes  the  time  to  compute 
the  complex  multiplies  should  be  divided  by  four.  The  only  exception 
applies  to  any  radix-2  or  -4  first  stage.  The  first  stage  doesn't  contain 
any  multiplies  because  the  coefficients  will  be  +1  for  radix-2  and  +1  or  +j 
for  radix-4. 
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Figure  6.  Partial  Flowchart  for  a  32-Data-Point  FFT  Using  Une 
Radix-2  Stage  Followed  by  Two  Radix-4  Stages 
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A  formula  will  now  be  derived  for  a  mixed-radix  FFT  when  radix-4  stages 
are  performed  first  and  the  radix-2  stage  is  performed  last.  In  this  case, 
there  are  (V  -  2)  radix-4  stages  with  multiplying  coefficients  done  in  four 
pipes,  one  radix-4  stage  done  in  two  pipes,  and  one  radix-2  stage  done  in 
one  pipe.  The  total  number  of  complex  multiplies  is  determined  by 
multiplying  the  number  of  stages  by  the  number  of  parallel  multiplies  per 
stage.  Thus,  the 


1  3N  3N 

NCM  (number  of  complex  multiplies)  =  (V  -  2)— ^  +  |2j4 

-  - 


which  reduces  to 


NCM  =  ^  ( 3V  ♦  8) , 


Similarly,  when  the  radix-2  stage  is  performed  first,  there  are  (V  -  I) 
radix-4  stages  done  in  four  parallel  pipes,  one  radix-4  stage  done  in  a 
single  pipe,  and  a  radix-2  stage  done  in  four  pipes,  which  has  no 
multiplying  coefficients  other  than  +1.  The  total  number  of  serial  complex 
multiplies,  considering  the  parallelism  of  the  four  pipes,  is 


NCM  =  A  (V  -  1)  f-  *  f  -  , 


which  reduces  to 


NCM  *  (3V  +  9) 


The  difference  between  the 


for  the  two  formulas,  above, 


HI  16. 


This  difference  is  not  a  significant  percentage  of  the  number  of  complex 
multiplies.  The  stage  order  is  really  driven  by  the  complexity  involved  in 
shuffling  data  between  pipes  since  the  first  case  requires  shuffling  for  two 
stages. 
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OPERAND  MEMORY 


Another  advantage  of  the  radix-4  parallel  pipe  arrangement  when 
computing  the  FFT  is  that  three  of  the  operand  memories  need  hold  only  one 
quarter  of  the  total  sample  size.  However,  the  operand  memory,  which  serves 
to  combine  the  four  individual  results  in  the  last  stage,  must  be  large 
enough  to  hold  all  N  points.  The  computation  scheme  described  is 
essentially  an  in-place  scheme.  Four  temporary  locations  must  be  allocated 
for  the  four  values,  G0(jn),  Gi(m),  G2(m),  and  63(111),  and  their 
values  must  be  saved  each  time  m  is  varied  from  0  to  (N/4)  -  1  in  each  stage. 


BIT  REVERSAL  TO  RE-ORDER  INPUT  DATA 


This  section  discusses  why  and  how  the  input  data  sequence  should  be 
re-ordered  before  proceeding  with  the  FFT  computations.  By  re-ordering  the 
input  sequence. 


.  the  FFT  computations  require  less  temporary  storage, 

.  the  resulting  output  sequence  is  in  its  natural  order,  and 
.  the  address  generation  required  to  index  to  the  proper  data  points 
and  coefficients  is  simplified. 


The  discussion  on  re-ordering  begins  with  the  simple  radix-2,  then  the 
radix-4,  and  finally  the  mixed  radix,  radix-2  and  -4. 


In  the  derivation  of  the  radix-2  decimation-in-time  algoritnm,2  the 
original  input  sequence  is  first  sorted  into  even  and  then  odd  numbered  data 
points.  Next,  the  even  numbered  points  are  sorted  into  an  even  and  an  odd 
group.  Similarly,  the  odd  points  from  the  first  sort  are  also  sorted  into 
an  even  and  an  odd  group.  The  sorting  process,  breaking  each  group  into  new 
even  and  odd  groups,  continues  in  each  stage  until  there  are  only  two  points 
left  in  each  group,  and  they  are  already  sorted  into  even  and  odd  because 
there  are  only  two  points  in  a  set. 


An  example  of  this  process  is  shown  in  table  2,  where  the  original 
binary  ordering,  the  ordering  after  the  first  sort,  and  the  ordering  after 
the  second  sort  are  listed.  A  third  sort  is  not  necessary. 
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After  the  first  sort  all  the  even  data  points  are  in  the  first  half  of 
the  sorted  sequence,  and  the  odd  points  are  in  the  second  half  of  the 
sequence.  The  same  result  is  obtained  by  considering  the  least  significant 
bit,  Jq,  in  the  original  binary  orderi  ng  to  be  the  most  significant  bit, 
and  the  most  significant  two  bits,  J2 J1 »  to  be  the  least  significant 
ones,  i.e.,  (^0^2 ) 2  -  The  second  sort  in  table  2  separates  the 
even  points  from  the  first  sort  into  even  and  odd  points,  ana  the  odd  points 
from  the  first  sort  are  also  separated  into  even  and  odd  points.  Observe 
tnat  the  same  result  is  obtained  by  making  the  most  significant  Dit,  Jj, 
of  the  J2J1  pit  pair  the  most  significant  bit  to  obtain  the  pair 
J 1 J2  -  As  there  are  only  two  points  in  the  resulting  subsequence  and 
they  are  already  in  even  and  then  odd  order.  A  third  sort  is  not  necessary. 


After  sorting,  the  location  of  an  original  aata  point  can  be  determined 
by  performing  the  well-known  bit  reversal  procedure  for  raaix-2  FFT's.  That 
is,  if  (J2<JiJuJ2  is  the  binary  representation  of  an  original  point, 
that  point  ends  up  in  location  (JqJiJ2)2  following  the  sorting  in 
the  various  stages  required  to  compute  the  FFT. 


The  above  process  is  generalized  for  radix-4  by  sorting  the  original  N- 
point  sequence  into  four  subsequences  composed  of  the  data  points  4r,  4r  1, 
4r  *  2,  and  4r  ♦  3,  r  =  0,  1,  ...,  (N/4)  -  1.  Then,  each  of  the  four 
resulting  subsequences  is  independently  sorted  the  same  way;  i.e., 
-  "-^quence  4r  is  sorted  into  4q,  4q  *  1,  4q  +  2,  and  4q  +  3,  q  =  0,  1,  ..., 


subsequence  4r  is  sorted 
(N/16)  -  1. 


The  sorting  process  continues  for  1 og4  (N)  stages.  The  original  data 
can  be  re-ordered  by  bit  reversing  the  base  four  digits  or  by  pair-wise 
reversing  the  binary  bit  representations  of  tne  sample  inaex.  For  example. 
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if  ^3^2^1^11)2  is  the  binary  representation  of  the  original 
sequence,  the  sorted  sequence  will  be  ordered  as  ( *Ji J0J3 J2 ) 2 * 


If  the  original  set  of  data  points  contains  fv  =  43  points,  then  three 
sorts  are  required,  although  the  last  sort  need  not  be  performed  as  there 
would  be  only  4  points  in  the  resulting  16  subsequences  to  sort,  and  they 
are  already  automatically  sorted.  If  the  binary  representation  of  the 
original  sequence  were  ( J  5  J4  *J  3  J  2  ^  1  *^0 )  2 »  the  points  would  end 
up  in  the  location  represented  by  ( ^0^3  ^2'-J5'J4 )  2  from  which 
the  radix -4  FFT  would  be  performed. 


For  the  mixed -radix  FFT,  the  sorted  ordering  of  the  data  can  be 
obtained  by  reversing  either  bits  or  pairs  of  bits.  As  an  example,  suppose 
32  =  2.42  data  points  are  to  be  transformed  by  performing  a 

decimation-in-time  FFT  having  a  radix-2  stage  first,  followed  by  two  radix-4 
stages.  Note  that  in  the  derivation  of  a  radix-2/-4/-4  FFT,  a  radix-4  stage 
would  be  derived  followed  by  another  radix-4  stage  and  then  a  radix-2 
stage.  However,  the  radix-2  stage  would  be  computed  first,  followed  by  the 
two  raaix-4  stages.  Let  ( J4J3J2J1  Jo).2  be  the  binary 
representation  of  the  original  data  sequence.  The  input  data  would  be 
sorted  by  4's,  then  by  4's  again,  and  finally  by  2's  into  even  and  ood 
points.  The  reversal  procedure  would  be  ( J2J1 J0J4J3) »  then  (Jo 

J2JiJ4d3)2>  where  the  entities  J4J3  and  J2J1  are  treated 

as  pairs  that  remain  in  fixed  positions  relative  to  each  other. 


The  above  re-ordering  is  necessary  prior  to  any  computation  so  that  the 
FFT  can  be  carried  out  "in  place."  If  the  flow  diagram  is  drawn  for  each 
stage  of  tne  FFT,  the  data  points  on  the  same  norizontal  level  transform 
into  points  on  that  same  horizontal  level.  No  temporary  memory  need  be 
allocated  except  for  two  locations  in  the  radix-2  kernal  and  four  in  the 
radix -4  kernal . 


The  net  effect  of  performing  the  radix-2  kernal  (butterfly)  or  radix-4 
kernal  (dragonfly)  on  re-ordered  data  is  that  the  resulting  frequency  data 
points  of  the  FFT  end  up  in  their  correct  order  and  the  intermediate  memory 
locations  can  be  overwritten  during  each  stage  of  the  FFT  computation. 


The  digit-reversal  procedure  can  take  place  in  hardware  simply  by 
building  a  binary  counter  that  counts  from  the  left  (most  significant  bit) 
or  by  transposing  the  wires  from  a  normal  counter,  as  shown  in  figure  7. 
This  counter  is  used  to  index  into  the  original  data  set  to  obtain  the 
properly  re-ordered  data  points. 


The  digit  reversal  counter  can  be  generalized  to  any  radix  vnot 
necessarily  powers  of  2;  by  using  an  adder  circuit  that  causes  a  register  to 
step  tnrouyh  a  bit  reversed  sequence  of  any  radix,  as  shown  in  figure  8. 
Here  the  adder  is  binary,  but  the  adoer  increments  are  chosen  to  simulate  a 
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counter  of  any  mixed  radix  desired,  like  3,  5,  4,  where  the  number  of  data 
points  is  factorable  as  N  =  (3)(5)(4). 

Sample  sizes  that  are  not  powers  of  2  are  generally  not  necessary3 
and  create  awkward  hardware. 


INVERSE  DFT's 


The  exact  same  parallel  processing  FFT  algorithm  and  hardware  discussed 
here  can  be  used  to  compute  the  inverse  OFT,  which  is  given  by 


N-l  ijlik 

x(k)  =  tt  Y  x(i)e 


The  inverse  DFT  can  be  obtained  using  the  forward  transform 
coefficients  and  by 


1.  conjugating  X(i)  to  obtain  X(i), 

2.  performing  the  forward  transform  to  obtain 


A  N-l  ■ _ 

x(k)  =  Y  XO  )e 

i-0 


-2*i 


/\ 

3.  dividing  x(k)  by  N  and  conjugating  the  result  to  obtain 


<00  . 


In  other  words,  to  obtain  the  inverse  DFT  simply  conjugate  the  input  data 
points,  take  forward  transform,  divide  by  N,  and  conjugate  the  result. 


CONCLUSION 


A  method  was  devised  to  efficiently  utilize  a  loosely  interconnected, 
SIMu  four  parallel  pipe  arithmetic  processor  with  a  single  common 

coefficient  memory  to  compute  the  FFT.  Previously  users  of  this 

architecture  employed  the  four  pipes  to  execute  four  independent  FFT's  on 
four  independent  data  sets,  thereby  limiting  its  use. 


The  method  described  here  uses  the  four  parallel  pipes  to  compute  a 
single  FFT  approximately  four  times  faster  than  that  possible  in  a  single 
pipe  arithmetic  processor.  A  decimation-in-time  radix-4  FFT  alyorithm  thct 
is  adapted  to  allow  the  computation  to  proceed  on  the  four  parallel  pipes  is 
derived.  Basically  the  original  N-point  data  set  is  separated  into  four  N/4 
point  data  sets,  and  each  of  these  points  is  partially  transformed  in  each 
of  the  four  pipes.  The  four  pipes  operate  in  parallel  until  the  last  stage, 
at  which  time  one  pipe  must  be  used  to  finish  the  computation. 


Formulas  showing  the  addition  and  multiplication  time  needed  to  compute 
an  N-point  radix-4  FFT  are  derived  and  described  here.  A  comparison  of 
computing  an  FFT  in  a  single  pipe  versus  four  parallel  pipes  is  made  to  show 
that  a  fourfold  decrease  in  execution  time  is  possible  with  the  four 
parallel  pipes.  Also,  a  method  is  shown  for  computing  an  FFT  whose  length 
is  not  a  power  of  4.  An  example  of  mixing  radix-2  and  -4  stages  shows  that 
the  radix-2  stage  can  be  performed  first  or  last.  Formulas  for  the 
execution  time  of  the  mixed  radix  FFT  computation  are  given. 


It  is  advantageous  to  re-order  the  input  data  in  the  described  FFT 
scheme  because  "in-place-computations"  can  save  memory  and  also  the  output 
appears  in  the  correct  order.  In  addition,  for  an  N-point  transform,  three 
operana  memories  need  be  N/4  words,  and  one  operand  memory  must  oe  N  words 
in  size. 


A  generalized  method  for  re-ordering  the  data  is  preserved  that  allot? 
data  to  be  prescrambled  for  radix-2,  -4,  or  r.ixed-radixes.  Tils  re-ordering 
f.«thcd  uses  an  arrangement  of  a  binary  counter  that  counts  from  the  most 
significant  rather  than  from  the  least  significant  bit.  The  output  uf  tl'S 
counter  can  be  used  to  index  naturally  ordered  input  data  points  and  locate 
the  appropriate  scraniled  data  points  for  the  radix-2  (butterfly)  or  the 
radix-4  (dragonfly)  FFT  computations. 


Finally  a  method  is  shown  that  allows  the  same  computations  to  produce 
the  inverse  UFT.  The  computation  method  described  here  can  be  generalized 
to  any  radix  FFT  and  to  any  number  of  parallel  processing  data  pipes  in  an 
arithmetic  processor. 
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